On the Reliability of RAID Systems: An Argument for More Check Drives

نویسندگان

  • Sarah Edge Mann
  • Michael Anderson
  • Marek Rychlik
چکیده

In this paper we address issues of reliability of RAID systems. We focus on “big data” systems with a large number of drives and advanced error correction schemes beyond RAID 6. Our RAID paradigm is based on Reed-Solomon codes, and thus we assume that the RAID consists of N data drives and M check drives. The RAID fails only if the combined number of failed drives and sector errors exceeds M, a property of Reed-Solomon codes. We review a number of models considered in the literature and build upon them to construct models usable for a large number of data and check drives. We attempt to account for a significant number of factors that affect RAID reliability, such as drive replacement or lack thereof, mistakes during service such as replacing the wrong drive, delayed repair, and the finite duration of RAID reconstruction. We evaluate the impact of sector failures that do not result in drive replacement. The reader who needs to consider large M and N will find applicable mathematical techniques concisely summarized here, and should be able to apply them to similar problems. Most methods are based on the theory of continuous time Markov chains, but we move beyond this framework when we consider the fixed time to rebuild broken hard drives, which we model using systems of delay and partial differential equations. One universal statement is applicable across various models: increasing the number of check drives in all cases increases the reliability of the system, and is vastly superior to other approaches of ensuring reliability such as mirroring.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Disk Array Storage System

Fault tolerance requirements for near term disk array storage systems are analyzed. The excellent reliability provided by RAID Level 5 data organization is seen to be insuucient for these systems. We consider various alternatives { improved MTBF and MTTR times as well as smaller reliability groups and increased numbers of check disks per group { to obtain the necessary improved reliability. The...

متن کامل

High-fidelity reliability simulation of XOR-based erasure codes

Erasure codes are the means by which storage systems are typically made reliable. Recent high profile studies of disk failure and sector failures indicate that ever more fault tolerant erasure codes are needed. Many traditional RAID approaches, parity-check array codes (e.g.,EVENODD, RDP, and X-code), and MDS codes offer two and three disk fault tolerant schemes. There are also many novel erasu...

متن کامل

Improving Storage System Reliability with Proactive Error Prediction

This paper proposes the use of machine learning techniques to make storage systems more reliable in the face of sector errors. Sector errors are partial drive failures, where individual sectors on a drive become unavailable, and occur at a high rate in both hard disk drives and solid state drives. The data in the affected sectors can only be recovered through redundancy in the system (e.g. anot...

متن کامل

Proactive error prediction to improve storage system reliability

This paper proposes the use of machine learning techniques to make storage systems more reliable in the face of sector errors. Sector errors are partial drive failures, where individual sectors on a drive become unavailable, and occur at a high rate in both hard disk drives and solid state drives. The data in the affected sectors can only be recovered through redundancy in the system (e.g. anot...

متن کامل

Exploring the performance impact of stripe size on network attached storage systems

Network Attached Storage (NAS) integrates Redundant Array of Independent Disks (RAID) subsystem that consists of multiple disk drives to aggregate storage capacity, I/O performance and reliability based on data striping and distribution. Traditionally, the stripe size is an important parameter that has a great influence on the RAID subsystem performance, whereas the performance impact has been ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1202.4423  شماره 

صفحات  -

تاریخ انتشار 2012